HDFS Asset URI: allow empty netloc for Hadoop fs.defaultFS#68022
HDFS Asset URI: allow empty netloc for Hadoop fs.defaultFS#68022stegololz wants to merge 2 commits into
Conversation
19e5a39 to
4536410
Compare
|
I fixed the tests but i'm not happy with the current implementation. The small caveat we have now that you want to know about: when the input is >>> from urllib.parse import urlsplit, urlunsplit
>>> urlsplit("hdfs:///apps/x") == urlsplit("hdfs:/apps/x")
True
>>> urlunsplit(urlsplit("hdfs:///apps/x"))
'hdfs:/apps/x'urlsplit cannot distinguish the two forms: they parse to identical SplitResult instances with an empty netloc and urlunsplit cannot emit // for an empty authority; it omits the // entirely. Since the provider hook only sees and returns a SplitResult, and the surrounding _sanitize_uri in task-sdk calls urlunsplit on the result, there is no way for the provider on its own to preserve the /// form in the stored URI. Round-trip identity is still stable: both hdfs:///apps/x and hdfs:/apps/x normalize to hdfs:/apps/x, so asset matching remains consistent across re-parses. For the OpenLineage conversion, an empty netloc produces a hdfs:// namespace, which is what consumers expect for fs.defaultFS-resolved paths. If preserving the literal hdfs:///path form in storage is considered worth the additional surface area, it could be done with a small opt-in in _sanitize_uri (for example, a normalizer.preserve_empty_authority = True attribute set when we want the empty-authority form retained, and a corresponding string-level fix-up after urlunsplit). That keeps the behavior change scoped to opt-in normalizers (only HDFS would set it today), and leaves others untouched. I deliberately left this out of the current PR to keep the change minimal and provider-local. Happy to follow up with that task-sdk change if someone thinks it is worth. |
The hdfs asset URI sanitizer rejected hdfs:///path as missing a namenode host. Per RFC 3986 the authority component is optional; per Hadoop semantics an empty authority means 'resolve via fs.defaultFS from core-site.xml' — i.e. hdfs:///apps/x is the canonical form for jobs that must not hard-code a namenode. Relax sanitize_uri to require only a non-empty path, and add positive + negative parametrized tests covering the default-fs form and the corresponding OpenLineage conversion.
4536410 to
5f6d16c
Compare
|
Update after digging into normalization. Dropping the host check alone is not enough: the SDK normalizes asset URIs with Current branch: Register In current behaviour |
| # Preserve the empty-authority "hdfs:///path" (fs.defaultFS) form through urlunsplit, like "file". | ||
| if "hdfs" not in urllib.parse.uses_netloc: | ||
| urllib.parse.uses_netloc.append("hdfs") |
There was a problem hiding this comment.
I don’t think it’s a good idea to do this; uses_netloc is undocumented and should be considered private.
When is this needed?
There was a problem hiding this comment.
I can drop the line or implement a similar behaviour somewhere else, it is not mandatory.
The SDK normalizes asset URIs through urlunsplit here:
airflow/task-sdk/src/airflow/sdk/definitions/asset/__init__.py
Lines 183 to 185 in e5bf1e3
That makes hdfs:///apps/x normalize to hdfs:/apps/x. I'd like to keep the canonical hdfs:/// form, which is why the if is there.
Caveat: as long as I stay on the provider, I won't be able to make a distinction between hdfs:/path and hdfs:///path (both end up as hdfs:///path), because the normalizer only sees the parsed result where both already have an empty netloc. To keep them distinct I'd have to also modify the Task SDK.
What would be the preference here?
Summary
Relax
airflow.providers.apache.hdfs.assets.hdfs.sanitize_urito accept the canonicalhdfs:///pathform (empty netloc). Previously rejected withValueError: URI format hdfs:// must contain a namenode host.Why
hdfs:///pathis well-formed.fs.defaultFSfromcore-site.xml". This is the standard idiom for portable Spark/Hive/MapReduce jobs that must not hard-code a namenode — same shape asfile:///etc/hosts.Asset("hdfs:///apps/x/file.parquet")at parse time.Change
providers/apache/hdfs/.../assets/hdfs.py: drop the "must contain a namenode host" check; keep the path-required check.providers/apache/hdfs/.../tests/.../test_hdfs.py:hdfs:///apps/myapp/...(empty netloc) — pass.hdfs://namenode:8020(no path) — fail.test_convert_asset_to_openlineage_default_fscovering OpenLineage emission with empty netloc.convert_asset_to_openlineagealready tolerates an empty netloc (f"hdfs://{parsed.netloc}"yieldshdfs://namespace), so no functional change there.Related
Gen-AI disclosure
This PR was prepared with Gen-AI assistance (Claude). I reviewed all generated code.